Auto-tuning a Matrix Routine for High Performance
نویسندگان
چکیده
Well-written scientific simulations typically get tremendous performance gains by using highly optimized library routines. Some of the most fundamental of these routines perform matrix-matrix multiplications and related routines, known as BLAS (Basic Linear Algebra Subprograms). Optimizing these library routines for efficiency is therefore of tremendous importance for many scientific simulations. In fact, some of them are often hand-optimized in assembly language for a given processor, in order to get the best possible performance. In this paper, we present a new tuning approach, combining a small snippet of assembly code with an auto-tuner. For our preliminary test-case, the symmetric rank-2 update, the resulting routine outperforms the best auto-tuner and vendor supplied code on our target machine, an Intel quad-core processor. It also performs less than 1.2% slower than the best hand coded library. Our novel approach shows a lot of promise for further performance gains on modern multi-core and many-core processors.
منابع مشابه
A Methodology for Automatically Tuned Parallel Tridiagonalization on Distributed Memory Vector-parallel Machines
In this paper, we describe an auto-tuning methodology for the parallel tridiagonalization to attain high performance. By searching the optimal set of three parameters for the performance, a highly eecient routine can be obtained automatically. Evaluation of the methodology on the distributed memory parallel machines, the HITACHI SR2201 and HITACHI SR8000, has been provided. The experimental res...
متن کاملA Note on Auto-tuning GEMM for GPUs
The development of high performance dense linear algebra (DLA) critically depends on highly optimized BLAS, and especially on the matrix multiplication routine (GEMM). This is especially true for Graphics Processing Units (GPUs), as evidenced by recently published results on DLA for GPUs that rely on highly optimized GEMM. However, the current best GEMM performance, e.g. of up to 375 GFlop/s in...
متن کاملOffline Auto-Tuning of a PID Controller Using Extended Classifier System (XCS) Algorithm
Proportional + Integral + Derivative (PID) controllers are widely used in engineering applications such that more than half of the industrial controllers are PID controllers. There are many methods for tuning the PID parameters in the literature. In this paper an intelligent technique based on eXtended Classifier System (XCS) is presented to tune the PID controller parameters. The PID controlle...
متن کاملAuto-tuning of level 1 and level 2 BLAS for GPUs
The use of high performance libraries for dense linear algebra operations is of great importance in many numerical scientific applications. The most common operations form the backbone of the Basic Linear Algebra Subroutines (BLAS) library. In this paper, we consider the performance and auto-tuning of level 1 and level 2 BLAS routines on GPUs. As examples, we develop single-precision CUDA kerne...
متن کاملAuto-tuning the Matrix Powers Kernel with SEJITS
The matrix powers kernel, used in communication-avoiding Krylov subspace methods, requires runtime auto-tuning for best performance. We demonstrate how the SEJITS (Selective Embedded Just-InTime Specialization) approach can be used to deliver a high-performance and performance-portable implementation of the matrix powers kernel to application authors, while separating their high-level concerns ...
متن کامل